In [1]:
import sys;sys.path.insert(0, "..") # For making revscoring accessible when running this from revscoring/ipython
from revscoring.features import revision, diff, Feature, modifiers
from revscoring.datasources.revision import text as revision_text
from revscoring.extractors import APIExtractor
from mw import api
In [2]:
extractor = APIExtractor(api.Session("https://en.wikipedia.org/w/api.php"))
In [3]:
list(extractor.extract(123456789, [diff.chars_added]))
Out[3]:
In [4]:
chars_added_ratio = Feature("diff.chars_added_ratio",
lambda a,c: a/max(c, 1), # Prevents divide by zero
depends_on=[diff.chars_added, revision.chars],
returns=float)
list(extractor.extract(123456789, [chars_added_ratio]))
Out[4]:
There's easier ways that we can do this though. I've overloaded simple mathematical operators to allow you to do simple math with feature and get a feature returned. This code roughly corresponds to what's going on above.
In [5]:
chars_added_ratio = diff.chars_added / modifiers.max(revision.chars, 1) # Prevents divide by zero
list(extractor.extract(123456789, [chars_added_ratio]))
Out[5]:
There's a also a set of datasources that are part of the dependency injection system. See revscoring/revscoring/datasources. I'll need to rename the diff
datasource when I import it because of the name clash. FWIW, you usually don't use features and datasources in the same context, so there's some name overlap.
In [6]:
from revscoring.datasources import diff as diff_datasource
list(extractor.extract(662953550, [diff_datasource.added_segments]))
Out[6]:
OK. Let's define a new feature for counting the number of templates added. I'll make use of mwparserfromhell to do this. See the docs.
In [7]:
import mwparserfromhell as mwp
templates_added = Feature("diff.templates_added",
lambda add_segments: sum(len(mwp.parse(s).filter_templates()) > 0 for s in add_segments),
depends_on=[diff_datasource.added_segments],
returns=int)
list(extractor.extract(662953550, [templates_added]))
Out[7]:
In [8]:
from revscoring.dependent import draw
draw(templates_added)
In the tree structure above, you can see how our new feature depends on "diff.added_segments" which depends on "diff.operations" which depends (as you might imaging) on the current and parent revision. Other features are a bit more complicated.
In [9]:
draw(diff.added_badwords_ratio)
In [ ]: